-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue 151 #157
Fix issue 151 #157
Conversation
c0d50c2
to
dcc10be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some technical comments but I still don't understand the logic.
Could you add a comment at the start of scan_columns()
to explain please?
E.g.,:
- what about numeric values encoded as factors?
- can a value be of type logical but in reality be a date?
- are you sure dates won't be counted twice?
as.numeric(as.Date("2020-01-01"))
#> [1] 18262
Created on 2024-07-23 with reprex v2.1.1
Please add more examples & tests with more diverse column types.
R/clean_data_helpers.R
Outdated
tmp <- suppressWarnings(as.numeric(x)) | ||
are_numeric <- round((sum(!is.na(tmp)) / n_rows), 6L) | ||
are_numeric <- 0L | ||
verdict <- type == "logical" | type == "factor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verdict <- type == "logical" | type == "factor" | |
verdict <- type == "logical" || type == "factor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I accounted for this in commit 828534d
.
R/clean_data_helpers.R
Outdated
# The percent of numeric and character value will be set automatically to 0 as | ||
# to prevent from the effects of the conversion to numeric and character. | ||
# | ||
types <- as.character(vapply(data, class, character(1L))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to use class()
or typeof()
here?
types <- as.character(vapply(data, class, character(1L))) | |
types <- vapply(data, class, character(1L)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class()
will gives us the inofrmation that we need - the composition in data types of every column.
typeof()
will tell us date values are double, numeric. This is not what we want to spot out from the scan_data()
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But class()
doesn't always return an output of length 1 so this will error.
You will see this with a test including POSIXct
columns for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still believe that typeof()
is not appropriate for this task.
I have made changes to account for multiple types in commit db420c8
.
Please add more examples & tests with more diverse column types. In particular, there is still an issue with dates. |
I have tested the function across multiple datasets. If you have a dataset with odd cases that are not accounted for yet, please share this with me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is going in the right direction.
The code is getting more and more complicated, which will inevitably lead to more bugs.
I would recommend:
- centering the different case treatment on
typeof()
which has a finite, known, set of outputs - think about what possible
scan_data()
types are possible for eachtypeof()
and create customscan()
function for each type. For example:- it seems that logical can only lead to logical and NA, why is it grouped with factors?
- character are on the other extreme of the spectrum since something that appears as a character can actually be any of the types recognized by
scan_data()
It may be easier to start from a blank page to not be influenced by the current code, which seems to push toward more and more complexity.
Pen & paper design, with a flow diagram of the different pathways, may help.
R/clean_data_helpers.R
Outdated
get_class <- function(x) { | ||
paste(class(x), collapse = ",") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, it's a bad idea to manipulate objects / code as text, especially to circumvent explicit safeguards
R/clean_data_helpers.R
Outdated
are_date <- 0L | ||
if (any(type %in% c("POSIXct", "POSIXt"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solution doesn't scale as we can't possibly add special cases for every existing types. Especially since packages can define their own types.
I had identified two special cases (logical abd factors) that I wanted to resolve using their level. But can be splitted into two small functions.
Character columns can contain data of any types. If you start by determining the proportion of characters in this function, most columns will appear to only contain characters. Detecting the other types first, then assigning the remaining to characters was the best alternative I could think of. Happy to explore any other suggestion around this.
I will detach myself from this for a day or two. I will give it a try with |
Maybe as a first question to determine the best way forward: what is the exact role of this function in a data cleaning pipeline? How is its output used to inform following steps? Why these specific types have been selected as potential types recognized by |
If by
For example, a column with both numeric and character values might require a call for
The selected types are:
These atomic vector types are built on top of the 4 basic types ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most important criteria in all the code we produce is that it should be clear and maintainable.
When you write code, please go through it to double check if that is the case.
In general, this will happen by using standard mechanisms (e.g., to test class) and by being explicit about the values of object. E.g., if the value is 1, let's write one.
If standard patterns or mechanisms are unknown to you, a good strategy is to start following a GitHub repo of a known package and look at the changes they are making.
R/clean_data_helpers.R
Outdated
names(scan_result) <- c("missing", "numeric", "date", "character", | ||
"logical", "date-time", "factor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this change? It doesn't give the impression that the list results from an explicit and motivated design choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this function is now mostly returning the class()
. Can we go back to the drawing board and write a list of requirements:
- when I have X type of data, I want to know Y to inform Z
For example, a column with both numeric and character values might require a call for convert_to_numeric(), or a conversion of the numeric values into character, or setting the values of one of these type to NA. With this information in hand, the user can plan the set of cleaning operations to apply on the dataset.
This is a good example for analyzing types in character
columns. What about the others?
@Bisaloo - Thanks for taking the time to review this. I disagree with some comments above. Let's meet tomorrow to clarify our conceptions of the function and remove all ambuguities around it. |
272cb8b
to
83a9150
Compare
R/clean_data_helpers.R
Outdated
|
||
# --- get the proportion of character values --- | ||
are_character <- round((1.0 - (are_na + are_numeric + | ||
are_date + are_logical)), 6L) | ||
# get the character count | ||
character_count <- initial_length - | ||
(na_count + logical_count + numeric_count + date_count) | ||
|
||
# --- return the output --- | ||
return(c(are_na, are_numeric, are_date, are_character, are_logical)) | ||
} | ||
# transform into proportions | ||
props <- round( | ||
c(na_count, numeric_count, date_count, character_count, logical_count) / | ||
initial_length, 4L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking back about this: are we sure we want to return a proportion? For example, a value could at the same time be recognized as a numeric or a potential date. Rather than guessing which one it is, shouldn't this be left to the user, who has all the context to make an informed choice?
This means the sum of the return vector could be more than 1 as we are counting some values in two categories but maybe that's not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guessing part only happens when Date
values are identified from the parsing using lubridate::parse_date_time()
. Otherwise, all numeric values are strictly considered as numeric. I think, the following could help users:
- Add a note about this in the function description
- Add the algorithm used within the function in the
details
section of the function documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this in mind, I believe my question is still valid: should we do any guessing at all? If we don't have a strong reason to, I'd suggest we don't.
For example, if we have "20210107", should it be a date or a numeric? For a Covid-19 dataset, it could be either. It's probably safer to leave the final choice to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is another proposition from @Degoot-AM
I would add a data type called ambiguous
and assigned to it non-differentiable entries and that would add the sum of row to 1 (100%).
Thanks for quoting these. |
To clarify, I am not talking about proportion vs percentage. I'm saying that we could potentially not sum to 1 (or 100%). Given the use case you described, maybe we don't care if we double count some values in different categories (e.g. We should maybe(?) return a simple "can it be converted to...?" without trying to rely on complex heuristics to differentiate between date & numeric. With this, the user can decide if the entire column should be date or numeric. Does this make more sense? |
This is related to the opened conversation above. Can we follow up this topic in that conversation? |
R/clean_data_helpers.R
Outdated
|
||
# --- get the proportion of character values --- | ||
are_character <- round((1.0 - (are_na + are_numeric + | ||
are_date + are_logical)), 6L) | ||
# get the character count | ||
character_count <- initial_length - | ||
(na_count + logical_count + numeric_count + date_count) | ||
|
||
# --- return the output --- | ||
return(c(are_na, are_numeric, are_date, are_character, are_logical)) | ||
} | ||
# transform into proportions | ||
props <- round( | ||
c(na_count, numeric_count, date_count, character_count, logical_count) / | ||
initial_length, 4L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this in mind, I believe my question is still valid: should we do any guessing at all? If we don't have a strong reason to, I'd suggest we don't.
For example, if we have "20210107", should it be a date or a numeric? For a Covid-19 dataset, it could be either. It's probably safer to leave the final choice to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two minor comments inline.
I think we're almost there:
- let's make sure all the points discussed throughout this process are in the design vignette
- let's finalize the conversation of double counting
I have accounted for the documentation and double counting in commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is behaving as intended:
scan_in_character("20210702", "test")
should return c(0, 1, 1, 0, 0)
scan_in_character(c("20210702", "2021/07/03", "3"), "test")
should return c(0, 0.6666, 0.6666, 0, 0)
It would probably be helpful to add tests that test exact results, in a stricter fashion that just testing class or length.
R/clean_data_helpers.R
Outdated
# --- get the proportion of logical values --- | ||
are_logical <- round((sum(is.logical(x)) / n_rows), 6L) | ||
# getting the date and numeric count as describe above | ||
date_count <- numeric_count <- 0L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
date_count <- numeric_count <- 0L | |
date_count <- 0L |
Since the value of numeric_count
will be set in all possible branches
Co-authored-by: Hugo Gruson <10783929+Bisaloo@users.noreply.github.com>
…for similar occurence of `six (06)`
d70b13f
to
356a143
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't post it as a suggestion but I believe this much simpler implementation should work:
scan_in_character <- function(x, x_name) {
# There might be, within a character column, values of type:
# character, numeric, date (date or date-time), NA, and logical
# In this function, we check the presence of these different types within a
# character column.
# save the variable length
# the character count is decreased by the number of occurrence a different
# data type is found.
initial_length <- character_count <- length(x)
# get the count of missing data (NA)
na_count <- sum(is.na(x))
x <- x[!is.na(x)]
character_count <- character_count - na_count
# We will check if there is any Date values within the variable by parsing the
# values, looking for the ones that fit any of the predefined format.
# When there is one or more Date values, they will be converted into
# numeric. The first numeric count is recorded at this point. The rest of the
# values are converted into numeric, and the second count of numeric is
# recorded. They will in turn be converted into date.
# If any of these numeric values is a Date (a numeric, which
# after conversion to Date, fall within the interval
# [50 years back from today's date, today's date]), it will add to the second
# date count.
# That way the Date count is the count of date identified from the
# parsing + the count of Dates within the numeric values. Similarly, the
# numeric count is the count of numeric values within dates values and count
# among the non-date values.
#
# NOTE: This is what justifies that the sum across rows in the output object
# could be > 1.
#
# When there is no Date values identified from the parsing, the variable
# is converted into numeric. The final numeric count is the sum of all the
# identified numeric values.
# The logical count is the number of TRUE and FALSE written in both lower
# and upper cases within the variable.
# The remaining values will be considered of type character.
# parsing the vector, looking for date values
are_date <- suppressWarnings(
as.Date(
lubridate::parse_date_time(
x,
orders = c("ymd", "ydm", "dmy", "mdy", "myd", "dym", "Ymd", "Ydm",
"dmY", "mdY", "mYd", "dYm")
)
)
)
date_count <- sum(!is.na(are_date))
character_count <- character_count - date_count
# convert everything to numeric and get the numeric count
are_numeric <- suppressWarnings(as.numeric(x))
numeric_count <- sum(!is.na(are_numeric))
numeric_and_date_count <- sum(!is.na(are_date) & !is.na(are_numeric))
if (numeric_and_date_count > 0) {
cli::cli_alert_warning(
"Found {numeric_and_date_count} value{?s} that can be both numeric and \\
date in column `{x_name}`"
)
}
# We don't want to remove numeric_and_date values twice
character_count <- character_count - numeric_count + numeric_and_date_count
# get logical count
logicals <- toupper(x) == "TRUE" | toupper(x) == "FALSE"
logical_count <- sum(logicals)
character_count <- character_count - logical_count
# transform into proportions
props <- round(
c(na_count, numeric_count, date_count, character_count, logical_count) /
initial_length, 4L
)
return(props)
}
Please let me know if you prefer that I push it to this branch or as a PR so you can review it more easily.
cli::cli_alert_warning( | ||
"Found {ambiguous_count} values that can be either numeric or date in", | ||
"column `{x_name}`" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results in an incomplete sentence:
dat <- data.frame(col1 = c("20210702", "test"))
scan_result <- cleanepi::scan_data(data = dat)
#> ! Found 1 values that can be either numeric or date in
Created on 2024-09-02 with reprex v2.1.1
Let's also handle pluralization while we're at it
cli::cli_alert_warning( | |
"Found {ambiguous_count} values that can be either numeric or date in", | |
"column `{x_name}`" | |
) | |
cli::cli_alert_warning( | |
"Found {ambiguous_count} value{?s} that can be either numeric or date \\ | |
in column `{x_name}`" | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Bisaloo - Please create a PR.
# 2 date values (one of them `20210702` is also numeric) | ||
# 2 numeric values in which one is also a date value, hence the | ||
# warning about the presence of ambiguous data | ||
dat <- data.frame(col1 = c(c("20210702", "2021/07/03", "3"), "test")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dat <- data.frame(col1 = c(c("20210702", "2021/07/03", "3"), "test")) | |
dat <- data.frame(col1 = c("20210702", "2021/07/03", "3", "test")) |
Closing this PR. The requested changes have been implemented on fix-issue-151-simplification`, which is now merged with main. |
This PR contains changes to adress issue #151